1/31/2024

Chapter 2 - Modeling Process

Data splitting (typical splits are 60% training / 40% testing, 70%/30%, or 80%/20%)
- Training set
- Test set
Sampling methods
- Simple random sample (you can use base R, the caret package, the rsample package, or the h2o package)
- Stratified sampling - (again, a variety of methods can be used)
Formula interfaces - (R formula is common, but there are other possibilities, just make sure you understand what you are doing before running the R code)
Different functions use different engines - e.g., lm(), glm(), and train() all use slightly different specifications
Resampling methods
- k-fold cross validation
- Bootstrapping
Bias variance trade-off
- Bias
- Variance
Hyperparameter tuning - more on this later when we get to the detailed methods
Model evaluation
- MSE - mean squared error
- RMSE - root mean squared error
- Deviance - mean residual deviance
- MAE - mean absolute error
- RMSLE - root mean squared logarithmic error
- R^2 - coefficient of determination
Classification models
- Misclassification - overall error
- Mean per class error
- MSE - mean squared error
- Cross entropy - log loss or deviance
- Gini index - mainly used with tree-based methods
- Accuracy
- Precision
- Sensitivity - recall
- Specificity
- AUC - area under the curve

Reading assignment - Please read Chapter 3 - Feature & Target Engineering.